Deadline: Mar 13th, 23:00
Academic Integrity
This project is individual - it is to be completed on your own. If you have questions, please post your query in the APS1070 Piazza Q&A forums (the answer might be useful to others!).
Do not share your code with others, or post your work online. Do not submit code that you have not written yourself. Students suspected of plagiarism on a project, midterm or exam will be referred to the department for formal discipline for breaches of the Student Code of Conduct.
In this project we work on a COVID-19 vaccination dataset for different countries at the end of each day.
Please fill out the following:
Download your notebook: File -> Download .ipynb
Click on the Files icon on the far left menu of Colab
Select & upload your .ipynb file you just downloaded, and then obtain its path (right click) (you might need to hit the Refresh button before your file shows up)
execute the following in a Colab cell:
%%shell
jupyter nbconvert --to html /PATH/TO/YOUR/NOTEBOOKFILE.ipynb
An HTML version of your notebook will appear in the files, so you can download it.
Submit both `HTML` and `IPYNB` files on Quercus for grading.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import math
from sklearn.preprocessing import StandardScaler
from IPython import display as show
sns.set_theme(context='notebook', style='darkgrid', palette='brg', font='sans-serif', font_scale=1.5, color_codes=True)
cases_raw = pd.read_csv(filepath_or_buffer = 'https://raw.githubusercontent.com/APS1070-W23/datasets/main/country_vaccinations_wide.csv',
index_col=0)
# cases_raw
Write a function to do the following: [0.25]
Apply StandardScalar to the data. Each day should have a mean of zero and a StD of 1. [0.25]
step 1 on the standardized dataset for the United States, China, and Canada. [0.25]United States, Canada, and China. What does it mean if the curve goes up or down (are the number of vaccinations negative?) What does the sign of values indicate? [0.25]# Step 2: Standardizing the data
scaler = StandardScaler()
cases_std = scaler.fit_transform(cases_raw)
cases_std = pd.DataFrame(cases_std, columns = cases_raw.columns, index = cases_raw.index)
# cases_std
print('Raw cases: Mean - 10th Column : {0:.4f}'.format(cases_raw.iloc[:,10].mean()))
print('Raw cases: StD - 10th Column : {0:.4f}'.format(cases_raw.iloc[:,10].std()))
print('Standardized cases: Mean - 10th Column : {0:.4f}'.format(cases_std.iloc[:,10].mean()))
print('Standardized cases: StD - 10th Column : {0:.4f}'.format(cases_std.iloc[:,10].std()))
Raw cases: Mean - 10th Column : 0.0090 Raw cases: StD - 10th Column : 0.1339 Standardized cases: Mean - 10th Column : -0.0000 Standardized cases: StD - 10th Column : 1.0022
# Step 1 & 3: Common Function Defined
def country_time_series(df, df_std, country_list, std = False):
if std:
fig, axs = plt.subplots(len(country_list), 1, figsize=(28, 16), dpi = 100)
for i, country in enumerate(country_list):
df_country = df_std[df_std.index==country].iloc[:,:].sum()
axs[i].plot(df_country.index, df_country.values)
axs[i].set_title(country)
axs[i].set_xlabel('Date')
axs[i].set_ylabel('Total Cases')
# axs[i].set_xticks(np.arange(1, 483, 20))
axs[i].xaxis.set_major_locator(plt.MaxNLocator(20))
plt.tight_layout()
print('Standardized Country Time Series')
plt.show()
else:
fig, axs = plt.subplots(len(country_list), 1, figsize=(28, 16), dpi = 100)
for i, country in enumerate(country_list):
df_country = df[df.index==country].iloc[:,:].sum()
axs[i].plot(df_country.index, df_country.values)
axs[i].set_title(country)
axs[i].set_xlabel('Date')
axs[i].set_ylabel('Total Cases')
# axs[i].set_xticks(np.arange(1, 483, 20))
axs[i].xaxis.set_major_locator(plt.MaxNLocator(20))
plt.tight_layout()
print('Country Time Series')
plt.show()
country_time_series(cases_raw, cases_std, ['United States', 'China', 'Canada'], std = False)
country_time_series(cases_raw, cases_std, ['United States', 'China', 'Canada'], std = True)
Country Time Series
Standardized Country Time Series
Discuss the trends in the standardized time-series for the United States, Canada, and China. What does it mean if the curve goes up or down (are the number of vaccinations negative?) What does the sign of values indicate? [0.25]
For the raw cases:
For the standardized cases:
The ups and downs indicated the rate of increase in the cases per day since it is standardized number of cases per day, thus if the rate of increase in cases is greater than the average rate of increase in cases, then the graph will go up and vice versa for down.
The positives and negatives are explained as follows:
get_sorted_eigen(df_cov) that gets the covariance matrix of dataframe df (from step 1), and returns sorted eigenvalues and eigenvectors using np.linalg.eigh. [0.25]# Step 1: Covariance Matrix
cases_std_cov = cases_std.cov()
print('The Shape of Covaraince Matirx:', cases_std_cov.shape)
# Step 2: Function defined
def get_sorted_eigen(df_cov):
# Compute eigenvalues and eigenvectors using np.linalg.eigh
eigenvalues, eigenvectors = np.linalg.eigh(df_cov)
# Sort the eigenvalues in descending order
idx = eigenvalues.argsort()[::-1]
eigenvalues = eigenvalues[idx]
eigenvectors = eigenvectors[:,idx]
# Return sorted eigenvalues and eigenvectors
return eigenvalues, eigenvectors
eigenvalues, eigenvectors = get_sorted_eigen(cases_std_cov)
# Step 3: Scree Plot
# Total Variance of the dataset
total_var = np.sum(eigenvalues)
# Explained variance for each principal component (PC)
explained_var = (eigenvalues / total_var) * 100
explained_var10 = explained_var[:10]
# Cumulative explained variance
cumulative_var = np.cumsum(explained_var)
cumulative_var10 = cumulative_var[:10]
# Scree plot with cumulative explained variance
plt.figure(figsize = (16, 9), dpi = 100)
plt.bar(range(10), explained_var10, label='Explained Variance')
plt.plot(cumulative_var10, 'r-x', markersize = 10, linewidth = 2, label='Cumulative Explained Variance')
plt.title('Scree Plot')
plt.xlabel('Number of Principal Components')
plt.ylabel('Cumulative Explained Variance')
plt.legend()
plt.show()
# Step 4: Number of PCs for 99% Variance
n_components = np.argmin(cumulative_var <= 99) + 1
print('Number of PCs needed to cover 99% Variance:', n_components)
The Shape of Covaraince Matirx: (483, 483)
Number of PCs needed to cover 99% Variance: 8
# Step 5: 16 PCs
# Plotting the PCs
fig, axs = plt.subplots(16, figsize=(28, 100), dpi = 100)
for i in range(16):
axs[i].set_title('PC' + str(i+1) + ' v/s Date')
axs[i].plot(cases_raw.columns, eigenvectors[:,i])
axs[i].xaxis.set_major_locator(plt.MaxNLocator(20))
axs[i].set_xlabel('Date')
axs[i].set_ylabel('Eigenvector ' + str(i+1))
plt.tight_layout(rect=[0.01, 0.01, 1, 0.97])
plt.show()
Compare the first two PCs with the rest of them. Do you see any difference in their trend? [0.5]
The first 2 PCs have a general decreasing trend compared to the rest, which have an upward going curve and more fluctuations in general. They are also relatively smoother than the othes as these may be the ones capturing the general trend of the entire dataset.
Create a function that:
Plots 4 figures:
The incremental reconstruction of the original (not standardized) time-series for the specified country in a single plot. [1.5]
You should at least show 5 curves in a figure for incremental reconstruction. For example, you can pick the following (or any other combination that you think is reasonable):
Hint: you need to compute the reconstruction for the standardized time-series first, and then scale it back to the original (non-standardized form) using the StandardScaler inverse_transform help...
(df - df_reconstructed). On the x-axis, you have dates, and on the y-axis, the residual error. Test your function using the United States, Canada, and China as inputs. [0.5]
def plot_country_figures(original_df, country_name):
scaler = StandardScaler()
df_std = scaler.fit_transform(original_df)
df_std = pd.DataFrame(df_std, columns = original_df.columns, index = original_df.index)
df_cov = df_std.cov()
eigenvalues, eigenvectors = get_sorted_eigen(df_cov)
fig, axs = plt.subplots(4, 1, figsize = (28,26), dpi = 100)
fig.tight_layout(pad = 3.0)
# Original Data Plot
df_country = original_df[original_df.index == country_name].iloc[:,:].sum()
axs[0].plot(df_country.index, df_country.values)
axs[0].set_title(country_name)
axs[0].set_xlabel('Date')
axs[0].set_ylabel('Total Cases')
axs[0].xaxis.set_major_locator(plt.MaxNLocator(20))
# Recontructed Data
PCs = [1, 2, 4, 8, 16, 32, 64]
for pc in PCs:
W = eigenvectors[:, :pc]
projected = np.dot(df_std, W)
reconstructed_std = np.dot(projected, W.T)
reconstructed_original = scaler.inverse_transform(reconstructed_std)
reconstructed_raw = pd.DataFrame(reconstructed_original, columns = original_df.columns, index = original_df.index)
axs[1].plot(df_country.index, df_country.values)
axs[1].plot(original_df.columns, reconstructed_raw.loc[country_name].values, label = 'PC' + str(pc))
axs[1].legend()
axs[1].set_title('Incremental Reconstruction for ' + country_name)
axs[1].set_xlabel('Date')
axs[1].set_ylabel('Reconstructed Number of Cases')
axs[1].xaxis.set_major_locator(plt.MaxNLocator(20))
# Residual Error
residual_error = original_df.loc[country_name] - reconstructed_raw.loc[country_name]
axs[2].plot(original_df.columns, residual_error)
axs[2].set_title('Residual Error of Best Reconstruction for ' + country_name)
axs[2].set_xlabel('Date')
axs[2].set_ylabel('Residual Error')
axs[2].xaxis.set_major_locator(plt.MaxNLocator(20))
# RMSE
RMSE = []
for i in range(10):
W = eigenvectors[:, :i]
projected = np.dot(df_std, W)
reconstructed_std = np.dot(projected, W.T)
reconstructed_original = scaler.inverse_transform(reconstructed_std)
reconstructed_raw = pd.DataFrame(reconstructed_original, columns = original_df.columns, index = original_df.index)
n = reconstructed_raw.loc[country_name].size
SE = sum(np.square(original_df.loc[country_name] - reconstructed_raw.loc[country_name]))/n
RMSE.append(math.sqrt(SE))
axs[3].plot(range(1,11), RMSE)
axs[3].set_title('RMSE with Number of Components for ' + country_name)
axs[3].set_xlabel('Number of Components')
axs[3].set_ylabel('RMSE')
return(RMSE)
# Empty 3x10 numpy array
RMSE_PCA = np.zeros((3,10))
RMSE_PCA[0] = plot_country_figures(cases_raw, 'United States')
RMSE_PCA[1] = plot_country_figures(cases_raw, 'Canada')
RMSE_PCA[2] = plot_country_figures(cases_raw, 'China')
Modify your code in part 3 to use SVD instead of PCA for extracting the eigenvectors. [1]
Explain if standardization or covariance computation is required for this part. Repeat part 3 and compare your PCA and SVD results. Write a function to make this comparison [0.5], and comment on the results. [0.5].
def plot_country_figures_SVD(original_df, country_name):
U, S, V = np.linalg.svd(original_df)
fig, axs = plt.subplots(4, 1, figsize = (28,26), dpi = 100)
fig.tight_layout(pad = 3.0)
# Original Data Plot
df_country = original_df[original_df.index == country_name].iloc[:,:].sum()
axs[0].plot(df_country.index, df_country.values)
axs[0].set_title(country_name)
axs[0].set_xlabel('Date')
axs[0].set_ylabel('Total Cases')
axs[0].xaxis.set_major_locator(plt.MaxNLocator(20))
# Recontructed Data
SVs = [1, 2, 4, 8, 16, 32, 64]
for sv in SVs:
u = U[:,:sv]
v = V[:sv,:]
reconstructed_original = (u * S[:sv]) @ v
reconstructed_raw = pd.DataFrame(reconstructed_original, columns = original_df.columns, index = original_df.index)
axs[1].plot(df_country.index, df_country.values)
axs[1].plot(original_df.columns, reconstructed_raw.loc[country_name].values, label = 'SV' + str(sv))
axs[1].legend()
axs[1].set_title('Incremental Reconstruction for ' + country_name)
axs[1].set_xlabel('Date')
axs[1].set_ylabel('Reconstructed Number of Cases')
axs[1].xaxis.set_major_locator(plt.MaxNLocator(20))
# Residual Error
residual_error = original_df.loc[country_name] - reconstructed_raw.loc[country_name]
axs[2].plot(original_df.columns, residual_error)
axs[2].set_title('Residual Error of Best Reconstruction for ' + country_name)
axs[2].set_xlabel('Date')
axs[2].set_ylabel('Residual Error')
axs[2].xaxis.set_major_locator(plt.MaxNLocator(20))
# RMSE
RMSE = []
for i in range(10):
u = U[:,:i]
v = V[:i,:]
reconstructed_original = (u * S[:i]) @ v
reconstructed_raw = pd.DataFrame(reconstructed_original, columns = original_df.columns, index = original_df.index)
n = reconstructed_raw.loc[country_name].size
SE = sum(np.square(original_df.loc[country_name] - reconstructed_raw.loc[country_name]))/n
RMSE.append(math.sqrt(SE))
axs[3].plot(range(1,11), RMSE)
axs[3].set_title('RMSE with Number of Components for ' + country_name)
axs[3].set_xlabel('Number of Components')
axs[3].set_ylabel('RMSE')
return (RMSE)
# Empty 3x10 numpy array
RMSE_SVD = np.zeros((3,10))
RMSE_SVD[0] = plot_country_figures_SVD(cases_raw, 'United States')
RMSE_SVD[1] = plot_country_figures_SVD(cases_raw, 'Canada')
RMSE_SVD[2] = plot_country_figures_SVD(cases_raw, 'China')
def pca_vs_svd(pca, svd, country_list):
for i in range(0,3):
plt.figure(1, figsize = (28,8), dpi = 100)
plt.plot(range(1,11), pca[i], label = 'PCA')
plt.plot(range(1,11), svd[i], label = 'SVD')
plt.title(f'RMSE of the PCA and SVD for {country_list[i]}')
plt.xlabel("Components Used")
plt.ylabel("RMSE")
plt.legend()
plt.show()
diff = []
for i in range(0,3):
diff.append(pca[i] - svd[i])
plt.figure(2, figsize = (28,8), dpi = 100)
plt.plot(range(1,11), diff[i], label = country_list[i])
plt.title(f'Difference in the RMSE for {country_list[i]}')
plt.xlabel("Components Used")
plt.ylabel("RMSE")
plt.show()
pca_vs_svd(RMSE_PCA, RMSE_SVD, ['United States', 'Canada', 'China'])
Explain if standardization or covariance computation is required for this part. Repeat part 3 and compare your PCA and SVD results. Write a function to make this comparison [0.5], and comment on the results. [0.5].
Standaridization or covariance matrix is not a requirement for SVD as SVD uses Singular Values and Singular Vectors rather than Eigen Values and Eigenvectors, which are derived from a covariance matrix. Also, the use of Singular Values and Singular Vectors makes them independant of standardization or scaling.
SVD's main feature is that the singular values and vectors capture the dominant patterns and structures in the data. This means that even if the data is not standardized or the covariance matrix is not computed, the SVD will still extract the most important information from the data.
Out of the 3 countries used to compare, it appears that SVD required less PCs to achieve the same RMSE than the PCA method. This makes it more efficient as there would be less storage required and less computational power required if the dataset is massive.
Create another dataset similar to the one provided in your handout using the raw information on COVID deaths (not vaccinations) provided here. [1]
You need to manipulate the data to organize it in the desired format (i.e., the same format that was in previous parts). You are free to use any tools you like, from Excel to Python!
In the end, you should have a new CSV file with more dates (features) compared to the provided dataset.
Upload your new dataset (in CSV format) to your colab notebook and repeat part 4. When analyzing Canada, you may use just Ontario. For China, you may use Beijing. [1]
Don't forget to add your new CSV file to Quercus for reference. The code below helps you to upload your new CSV file to your colab session.
# load train.csv to Google Colab
# from google.colab import files
# uploaded = files.upload()
key_countries = pd.read_csv('key-countries-pivoted_csv.csv')
key_countries = key_countries.set_index('country').transpose()
key_countries
| country | 22/01/2020 | 23/01/2020 | 24/01/2020 | 25/01/2020 | 26/01/2020 | 27/01/2020 | 28/01/2020 | 29/01/2020 | 30/01/2020 | 31/01/2020 | ... | 14/01/2022 | 15/01/2022 | 16/01/2022 | 17/01/2022 | 18/01/2022 | 19/01/2022 | 20/01/2022 | 21/01/2022 | 22/01/2022 | 23/01/2022 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| China | 548 | 643 | 920 | 1406 | 2075 | 2877 | 5509 | 6087 | 8141 | 9802 | ... | 117678 | 117800 | 118030 | 118204 | 118297 | 118370 | 118470 | 118544 | 118616 | 118773 |
| US | 1 | 1 | 2 | 2 | 5 | 5 | 5 | 6 | 6 | 8 | ... | 65069619 | 65463200 | 65930556 | 66590148 | 67693339 | 68684431 | 69329860 | 70209840 | 70495874 | 70699416 |
| United_Kingdom | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2 | ... | 15165209 | 15246179 | 15316611 | 15407216 | 15501850 | 15610069 | 15718193 | 15814617 | 15891905 | 15966838 |
| Italy | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2 | ... | 8356514 | 8549450 | 8706915 | 8790302 | 9018425 | 9219391 | 9418256 | 9603856 | 9781191 | 9923678 |
| France | 0 | 0 | 2 | 3 | 3 | 3 | 4 | 5 | 5 | 5 | ... | 13680775 | 14005385 | 14283514 | 14386178 | 14851448 | 15288014 | 15715329 | 16116748 | 16506090 | 16807733 |
| Germany | 0 | 0 | 0 | 0 | 0 | 1 | 4 | 4 | 4 | 5 | ... | 7943959 | 7988210 | 8021339 | 8104157 | 8222262 | 8361262 | 8502132 | 8635461 | 8716804 | 8773030 |
| Spain | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 8093036 | 8093036 | 8093036 | 8424503 | 8518975 | 8676916 | 8834363 | 8975458 | 8975458 | 8975458 |
| Iran | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 6217320 | 6218741 | 6221033 | 6224196 | 6227849 | 6231909 | 6236567 | 6241843 | 6245346 | 6250490 |
8 rows × 733 columns
def plot_country_figures(original_df, country_name):
scaler = StandardScaler()
df_std = scaler.fit_transform(original_df)
df_std = pd.DataFrame(df_std, columns = original_df.columns, index = original_df.index)
df_cov = df_std.cov()
eigenvalues, eigenvectors = get_sorted_eigen(df_cov)
fig, axs = plt.subplots(4, 1, figsize = (28,26), dpi = 100)
fig.tight_layout(pad = 3.0)
# Original Data Plot
df_country = original_df[original_df.index == country_name].iloc[:,:].sum()
axs[0].plot(df_country.index, df_country.values)
axs[0].set_title(country_name)
axs[0].set_xlabel('Date')
axs[0].set_ylabel('Total Cases')
axs[0].xaxis.set_major_locator(plt.MaxNLocator(20))
# Recontructed Data
PCs = [1, 2, 4, 8]
for pc in PCs:
W = eigenvectors[:, :pc]
projected = np.dot(df_std, W)
reconstructed_std = np.dot(projected, W.T)
reconstructed_original = scaler.inverse_transform(reconstructed_std)
reconstructed_raw = pd.DataFrame(reconstructed_original, columns = original_df.columns, index = original_df.index)
axs[1].plot(df_country.index, df_country.values)
axs[1].plot(original_df.columns, reconstructed_raw.loc[country_name].values, label = 'PC' + str(pc))
axs[1].legend()
axs[1].set_title('Incremental Reconstruction for ' + country_name)
axs[1].set_xlabel('Date')
axs[1].set_ylabel('Reconstructed Number of Cases')
axs[1].xaxis.set_major_locator(plt.MaxNLocator(20))
# Residual Error
residual_error = original_df.loc[country_name] - reconstructed_raw.loc[country_name]
axs[2].plot(original_df.columns, residual_error)
axs[2].set_title('Residual Error of Best Reconstruction for ' + country_name)
axs[2].set_xlabel('Date')
axs[2].set_ylabel('Residual Error')
axs[2].xaxis.set_major_locator(plt.MaxNLocator(20))
# RMSE
RMSE = []
for i in range(8):
W = eigenvectors[:, :i]
projected = np.dot(df_std, W)
reconstructed_std = np.dot(projected, W.T)
reconstructed_original = scaler.inverse_transform(reconstructed_std)
reconstructed_raw = pd.DataFrame(reconstructed_original, columns = original_df.columns, index = original_df.index)
n = reconstructed_raw.loc[country_name].size
SE = sum(np.square(original_df.loc[country_name] - reconstructed_raw.loc[country_name]))/n
RMSE.append(math.sqrt(SE))
axs[3].plot(range(1,9), RMSE)
axs[3].set_title('RMSE with Number of Components for ' + country_name)
axs[3].set_xlabel('Number of Components')
axs[3].set_ylabel('RMSE')
return(RMSE)
# Empty 3x10 numpy array
RMSE_PCA = np.zeros((3,8))
RMSE_PCA[0] = plot_country_figures(key_countries, 'US')
RMSE_PCA[1] = plot_country_figures(key_countries, 'United_Kingdom')
RMSE_PCA[2] = plot_country_figures(key_countries, 'China')
def plot_country_figures_SVD(original_df, country_name):
U, S, V = np.linalg.svd(original_df)
fig, axs = plt.subplots(4, 1, figsize = (28,26), dpi = 100)
fig.tight_layout(pad = 3.0)
# Original Data Plot
df_country = original_df[original_df.index == country_name].iloc[:,:].sum()
axs[0].plot(df_country.index, df_country.values)
axs[0].set_title(country_name)
axs[0].set_xlabel('Date')
axs[0].set_ylabel('Total Cases')
axs[0].xaxis.set_major_locator(plt.MaxNLocator(20))
# Recontructed Data
SVs = [1, 2, 4, 8]
for sv in SVs:
u = U[:,:sv]
v = V[:sv,:]
reconstructed_original = (u * S[:sv]) @ v
reconstructed_raw = pd.DataFrame(reconstructed_original, columns = original_df.columns, index = original_df.index)
axs[1].plot(df_country.index, df_country.values)
axs[1].plot(original_df.columns, reconstructed_raw.loc[country_name].values, label = 'SV' + str(sv))
axs[1].legend()
axs[1].set_title('Incremental Reconstruction for ' + country_name)
axs[1].set_xlabel('Date')
axs[1].set_ylabel('Reconstructed Number of Cases')
axs[1].xaxis.set_major_locator(plt.MaxNLocator(20))
# Residual Error
residual_error = original_df.loc[country_name] - reconstructed_raw.loc[country_name]
axs[2].plot(original_df.columns, residual_error)
axs[2].set_title('Residual Error of Best Reconstruction for ' + country_name)
axs[2].set_xlabel('Date')
axs[2].set_ylabel('Residual Error')
axs[2].xaxis.set_major_locator(plt.MaxNLocator(20))
# RMSE
RMSE = []
for i in range(8):
u = U[:,:i]
v = V[:i,:]
reconstructed_original = (u * S[:i]) @ v
reconstructed_raw = pd.DataFrame(reconstructed_original, columns = original_df.columns, index = original_df.index)
n = reconstructed_raw.loc[country_name].size
SE = sum(np.square(original_df.loc[country_name] - reconstructed_raw.loc[country_name]))/n
RMSE.append(math.sqrt(SE))
axs[3].plot(range(1,9), RMSE)
axs[3].set_title('RMSE with Number of Components for ' + country_name)
axs[3].set_xlabel('Number of Components')
axs[3].set_ylabel('RMSE')
return (RMSE)
# Empty 3x10 numpy array
RMSE_SVD = np.zeros((3,8))
RMSE_SVD[0] = plot_country_figures_SVD(key_countries, 'US')
RMSE_SVD[1] = plot_country_figures_SVD(key_countries, 'United_Kingdom')
RMSE_SVD[2] = plot_country_figures_SVD(key_countries, 'China')
def pca_vs_svd(pca, svd, country_list):
for i in range(0,3):
plt.figure(1, figsize = (28,8), dpi = 100)
plt.plot(range(1,9), pca[i], label = 'PCA')
plt.plot(range(1,9), svd[i], label = 'SVD')
plt.title(f'RMSE of the PCA and SVD for {country_list[i]}')
plt.xlabel("Components Used")
plt.ylabel("RMSE")
plt.legend()
plt.show()
diff = []
for i in range(0,3):
diff.append(pca[i] - svd[i])
plt.figure(2, figsize = (28,8), dpi = 100)
plt.plot(range(1,9), diff[i], label = country_list[i])
plt.title(f'Difference in the RMSE for {country_list[i]}')
plt.xlabel("Components Used")
plt.ylabel("RMSE")
plt.show()
pca_vs_svd(RMSE_PCA, RMSE_SVD, ['US', 'United_Kingdom', 'China'])
Understanding PCA and SVD:
https://towardsdatascience.com/pca-and-svd-explained-with-numpy-5d13b0d2a4d8
https://hadrienj.github.io/posts/Deep-Learning-Book-Series-2.8-Singular-Value-Decomposition/
PCA:
Covid Data: